Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Jul 19, 2024

This change implements the SupportsRecoveryOperations mixin for S3FileIO. The implementation is a best effort recovery which first determines the latest object version, and if that is found, then performs the actual recovery by performing an s3#copy operation.

This can be used as a primitive in repair procedures that the community is working on or even simply work standalone in someone's own logic if they want to correct their table state by recovering files which may have accidentally removed (assuming their bucket has versioning enabled)

@amogh-jahagirdar amogh-jahagirdar force-pushed the supports-recovery-mixin-s3-impl branch from 3d59a66 to 6554f38 Compare August 7, 2024 05:39
@github-actions github-actions bot removed the API label Aug 7, 2024
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review August 7, 2024 05:40
@amogh-jahagirdar amogh-jahagirdar changed the title API, AWS: Add SupportsRecoveryOperations mixin for FileIO and implementation for S3FileIO AWS: Implement SupportsRecoveryOperations for S3FileIO Aug 7, 2024
@amogh-jahagirdar
Copy link
Contributor Author

cc @singhpk234 @geruh @rahil-c

Comment on lines +112 to +116
s3.putBucketVersioning(
PutBucketVersioningRequest.builder()
.bucket(bucketName)
.versioningConfiguration(
VersioningConfiguration.builder().status(BucketVersioningStatus.ENABLED).build())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplest way for integration tests seemed to be just to enable bucket versioning for the bucket and then on cleanup, ensure deletion of every object version to make sure no garbage is left over.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks, @amogh-jahagirdar!

@amogh-jahagirdar amogh-jahagirdar force-pushed the supports-recovery-mixin-s3-impl branch from 6554f38 to ea79984 Compare August 8, 2024 03:39
ListObjectVersionsIterable response =
client()
.listObjectVersionsPaginator(
ListObjectVersionsRequest.builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we're using a mixture of patterns (builder/request object). Could we just switch this to:

builder -> builder.bucket(location.bucket()).prefix(location.key())

like we do below?


Optional<ObjectVersion> latestVersion =
response.versions().stream().max(Comparator.comparing(ObjectVersion::lastModified));
return latestVersion.map(version -> recoverObject(version, location.bucket())).orElse(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to put a check here for version.isLatest() just to protect against being called on an object this the latest version. It's unlikely, but it could actually recover and older version since there is no direct check here.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be my poor naming, this latestVersion is the version we want to recover to.
For the "isLatest()" check S3 considers deletion markers .

isLatest will only be true for the deletion marker version in the recovery case.

The version that we want to recover to is not the latest version based on S3's consideration of deletion markers; it's the version just before that which the current check will do based on the last modified of the versions (deletions are not included in response.versions().

I'll call this variable recoveryVersion and add an inline comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip the recovery (i.e copy) if recoveryVersion.isLatest() is true as well, this will help the case when we are trying to recover an object which is not deleted yet, so it will not have any deletionMarker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems like a simple check we can do to avoid unnecessary copies, updated! I didn't go down the route of comparing the latest deletion marker with the version to recover to, because inverting the logic to just verify that the version to recover to is not the latest is simple; and we know that "version to recover" cannot represent a deletion marker due to the guarantees provided by the S3 api.

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

@amogh-jahagirdar amogh-jahagirdar force-pushed the supports-recovery-mixin-s3-impl branch 2 times, most recently from c4c8771 to 6e3be14 Compare August 8, 2024 18:35
@amogh-jahagirdar amogh-jahagirdar force-pushed the supports-recovery-mixin-s3-impl branch from 6e3be14 to c1124e1 Compare August 8, 2024 18:38
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @amogh-jahagirdar for taking up this up !

Comment on lines +447 to +449
if (version.isLatest()) {
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional] we can have an integ test trying to restore an existing object doesn't leads to creating a new version of obj

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this, sure I can do a follow on to verify that we skip the recovery if the latest version is not a deletion marker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants